Add explicit_hydrogen parameter#741
Conversation
CodSpeed Performance ReportMerging #741 will not alter performanceComparing Summary
|
| if has_charge_annot: | ||
| rdkit_atom.SetFormalCharge(atoms.charge[i].item()) | ||
| if explicit_hydrogen: | ||
| rdkit_atom.SetNoImplicit(True) |
There was a problem hiding this comment.
In the base case where explicity_hydrogen=True, what would this mean if I just call to_mol for a typical crystal structure that won't have any hydrogens resolved? Will RDKit infer charges / valences in that case? If so this might lead to broken molecules -- if that were the case, should we check that if explicit hydrogen is true there have to be hydrogens in the structure? (I could see this being sth users will try -- at least I would have tried it :D )
There was a problem hiding this comment.
Charges would also not be inferred automatically before this PR. But I am not sure what this means for valence: As the bond types are also explicitly set I guess RDKit assumes a radical, if explicit_hydrogen=True, but hydrogen atoms are actually missing. Probably, I should test this.
Having a check there sounds like a good idea, especially as I would agree that this could be a common mistake. However, I also think strictly checking for the simple presence of hydrogen atoms might not be sensible enough, as there are valid molecules without hydrogen atoms, although they appear rarely. What do you think about raising a warning as a 'reminder' to the user to check the input?
|
I checked the different behaviors when an import biotite.interface.rdkit as rdkit_interface
import biotite.structure.info as info
import rdkit.Chem.AllChem as Chem
import rdkit.Chem.Draw as Draw
atoms = info.residue("C")
atoms = atoms[atoms.element != "H"]
mol = rdkit_interface.to_mol(atoms, explicit_hydrogen=True)
Chem.Compute2DCoords(mol)
Draw.MolToFile(mol, "explicit.png")
mol = rdkit_interface.to_mol(atoms, explicit_hydrogen=False)
Chem.Compute2DCoords(mol)
Draw.MolToFile(mol, "non_explicit.png")
So |
|
I added a warning in case the structure contains no hydrogen atoms. @Croydon-Brixton Could you have a look again? |
| HXT | ||
| """ | ||
| mol = EditableMol(Mol()) | ||
| hydrogen_mask = atoms.element == "H" |
There was a problem hiding this comment.
nit: I just remembered that I've seen the very occasional structures with deuterium as well (e.g. 1wq2). How would we want to deal with these cases? Do we want to treat this isotope as hydrogen as well (probably chemically most appropriate) or not? If so, the PDB represents deuterium as the element "D"
There was a problem hiding this comment.
This is indeed an important edge case! I agree, that deuterium should be handled like hydrogen. However, this problem is not isolated to to_mol() but involves difference places in the code base. Hence I added a separate issue (#758) and propose to handle this afterwards.
There was a problem hiding this comment.
I also thought about it again, and I would prefer not removing hydrogen atoms, if explicit_hydrogen=False, but raise an exception instead. This would ensure that the atom indices of the created Mol always corresponds to the atom indices of the input AtomArray. This would for example matter when using indices obtained from Mol.GetSubstructMatch() to filter the AtomArray.
src/biotite/interface/rdkit/mol.py
Outdated
| if explicit_hydrogen: | ||
| if not hydrogen_mask.any(): | ||
| warnings.warn( | ||
| "No hydrogen found in the input, although 'explicit_hydrogen' is 'True'" |
There was a problem hiding this comment.
| "No hydrogen found in the input, although 'explicit_hydrogen' is 'True'" | |
| "No hydrogen found in the input, although 'explicit_hydrogen' is 'True'. " | |
| "This may lead to atoms wrongly interpreted as radicals. Be careful." |
| rdkit_atom = Atom(atoms.element[i].capitalize()) | ||
| rdkit_atom = Chem.Atom(atoms.element[i].capitalize()) | ||
| if explicit_hydrogen: | ||
| rdkit_atom.SetNoImplicit(True) |
There was a problem hiding this comment.
Hmm to my mind this option is still problematic, at least given the defaults of the to_mol function:
By default most PDBs won't have any hydrogens present, but the default in to_mol here is explicit_hydrogen=True. This will lead to radicals upon calling Chem.Sanitize, which is undesirable almost always and will probably catch a lot of users (especially those less well versed in RDKit) off guard.
I would suggest:
- Changing the default to not expect explicit hydrogens.
- Emitting a warning if a user says they're having explicit hydrogens, but no hydrogens are found, with a note that this can lead to radicals. After that, they're on their own ^^
There was a problem hiding this comment.
Probably you are right, that the more common case is explicit_hydrogen=False 👍.
|
Not directly related to the radicals, but another thing I noted while looking at this: |
|
@padix-key I've added my suggested changes here for your convenience: padix-key#13 |
|
Thanks @Croydon-Brixton for your support. Your suggested changes look basically good to me. I would have some minor suggestions, but it is probably less confusing, if I bring them up in this PR. |
|
I mainly added three changes:
|
4c7204a to
41a843f
Compare
… Also set coordinates of non-3d conformers to nan even when explicitly requested, as they are meaningless.
…g them as 'str' objects. Also type-cast the rdkit internal 'isImplicit' flag.
Co-authored-by: Simon Mathis <simon.mathis@gmail.com>
|
@Croydon-Brixton Do you think this PR is ready to merge? |
Croydon-Brixton
left a comment
There was a problem hiding this comment.
Oops, sorry the requested re-review somehow slipped through my fingers!
Yes, the changes look all good to me and from my point of view this is ready to merge. While reading over it again I just found one tiny typo.
- I like the new default of 'None' for our explicit hydrogen policy. I think inferring makes a lot of sense and the edge cases where this isn't desired (e.g. some small molecule that truly has no hydrogens) should be harmless, since re-inferring should conclude that indeed no hydrogens are needed.
- The selection of 2D & 3D conformers is very neat, thank you!!
- Thank you for finding a way to keep the annotation dtypes preserved where possible!
| conformer_id : int, optional | ||
| The conformer to be converted. | ||
| By default, an :class:`AtomArrayStack` with all conformers is returned. | ||
| conformer_id : int or {"2D", "3D"}, optional |
| assert test_annot.dtype == ref_annot.dtype | ||
| assert test_annot.tolist() == ref_annot.tolist() |
There was a problem hiding this comment.
Great to see the annotation dtypes preserved! Well done for finding a way to do that :D
|
Thanks again! As the failing test will be fixed by #761, I will merge this PR. |
* Add `explicit_hydrogen` parameter * Use canonical rdkit import * fix: enable returning molecules without conformers (with nan coords). Also set coordinates of non-3d conformers to nan even when explicitly requested, as they are meaningless. * feat: attempt to type-cast extra annotations instead of always leaving them as 'str' objects. Also type-cast the rdkit internal 'isImplicit' flag. * fix: suggestion for default case & warnings for explicit hydrogens * Update src/biotite/interface/rdkit/mol.py Co-authored-by: Simon Mathis <simon.mathis@gmail.com> * Infer `explicit_hydrogen` from presence of hydrogen atoms * Add author * Make 2D and 3D conformers selectable * Allow multiple types for extra annotations * Note that the atoms are in the same order * Fix typo Co-authored-by: Simon Mathis <simon.mathis@gmail.com> --------- Co-authored-by: Simon Mathis <simon.mathis@gmail.com>







This parameter in
interface.rdkit.to_mol()defines whether hydrogen should be explicitly or implicitly included in the createdMol